{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "T3l-XPuhit0_"
   },
   "source": [
    "# Cross Encoder Reranker\n",
    "\n",
    "- Author: [JeongHo Shin](https://github.com/ThePurpleCollar)\n",
    "- Peer Review:\n",
    "- Proofread : [JaeJun Shim](https://github.com/kkam-dragon)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/11-Reranker/01-CrossEncoderReranker.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/11-Reranker/01-CrossEncoderReranker.ipynb)\n",
    "## Overview\n",
    "\n",
    "The **Cross Encoder Reranker** is a technique designed to enhance the performance of **Retrieval-Augmented Generation (RAG)** systems.<br>This guide explains how to implement a reranker using Hugging Face's Cross Encoders to refine the ranking of retrieved documents, promoting those most relevant to a query.\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [Key Features and Mechanism](#key-features-and-mechanism)\n",
    "- [Practical Applications](#practical-applications)\n",
    "- [Implementation](#implementation)\n",
    "- [Key Advantages of Reranker](#key-advantages-of-reranker)\n",
    "- [Document Count Settings for Reranker](#document-count-settings-for-reranker)\n",
    "- [Trade-offs When Using a Reranker](#trade-offs-when-using-a-reranker)\n",
    "\n",
    "### References\n",
    "\n",
    "[Hugging Face Cross Encoders](https://huggingface.co/cross-encoder)\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can alternatively set OPENAI_API_KEY in .env file and load it.\n",
    "\n",
    "[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configuration file to manage API keys as environment variables\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Load API key information\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Features and Mechanism\n",
    "\n",
    "### Purpose\n",
    "- Re-rank retrieved documents to refine their ranking, prioritizing the most relevant results for the query.\n",
    "\n",
    "### Structure\n",
    "- Accepts both the **query** and **document** as a single input pair, enabling joint processing.\n",
    "\n",
    "### Mechanism\n",
    "- **Single Input Pair** :  \n",
    "  Processes the **query** and **document** as a combined input to output a relevance score directly.\n",
    "- **Self-Attention Mechanism** :  \n",
    "  Uses self-attention to jointly analyze the **query** and **document** , effectively capturing their semantic relationship.\n",
    "\n",
    "### Advantages\n",
    "- **Higher Accuracy** :  \n",
    "  Provides more precise similarity scores.\n",
    "- **Deep Contextual Analysis** :  \n",
    "  Explores semantic nuances between **query** and **document** .\n",
    "\n",
    "### Limitations\n",
    "- **High Computational Costs** :  \n",
    "  Processing can be time-intensive.\n",
    "- **Scalability Issues** :  \n",
    "  Not suitable for large-scale document collections without optimization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practical Applications\n",
    "- A **Bi-Encoder** quickly retrieves candidate **documents** by computing lightweight similarity scores.  \n",
    "- A **Cross Encoder** refines these results by deeply analyzing the semantic relationship between the **query** and the retrieved **documents** ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementation\n",
    "- Use Hugging Face cross encoders or ```BAAI/bge-reranker``` models.\n",
    "- Easily integrate with frameworks like **LangChain** through the ```CrossEncoderReranker``` component."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Advantages of Reranker\n",
    "- **Precise Similarity Scoring** :  \n",
    "  Delivers highly accurate measurements of relevance between the **query** and **documents**.\n",
    "- **Semantic Depth** :  \n",
    "  Analyzes deeper semantic relationships, uncovering nuances in **query** - **document** interactions.\n",
    "- **Refined Search Quality** :  \n",
    "  Improves the relevance and quality of the retrieved **documents** .\n",
    "- **RAG System Boost** :  \n",
    "  Enhances the performance of **Retrieval-Augmented Generation (RAG)** systems by refining input relevance.\n",
    "- **Seamless Integration** :  \n",
    "  Easily adaptable to various workflows and compatible with multiple frameworks.\n",
    "- **Model Versatility** :  \n",
    "  Offers flexibility with a wide range of pre-trained models for tailored use cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Document Count Settings for Reranker\n",
    "- Reranking is generally performed on the top **5–10** **documents** retrieved during the initial search.\n",
    "- The ideal number of **documents** for reranking should be determined through experimentation and evaluation, as it depends on the dataset characteristics and computational resources available."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs When Using a Reranker\n",
    "- **Accuracy vs Processing Time** :  \n",
    "  Striking a balance between achieving higher accuracy and minimizing processing time.\n",
    "- **Performance Improvement vs Computational Cost** :  \n",
    "  Weighing the benefits of improved performance against the additional computational resources required.\n",
    "- **Search Speed vs Relevance Accuracy** :  \n",
    "  Managing the trade-off between faster retrieval and maintaining high relevance in results.\n",
    "- **System Requirements** :  \n",
    "  Ensuring the system meets the necessary hardware and software requirements to support reranking.\n",
    "- **Dataset Characteristics** :  \n",
    "  Considering the scale, diversity, and specific attributes of the dataset to optimize reranker performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tk9Sg0AKi76R"
   },
   "source": [
    "Explaining the Implementation of **Cross Encoder Reranker** with a Simple Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "_DlGOaxxozWp"
   },
   "outputs": [],
   "source": [
    "# Helper function to format and print document content\n",
    "def pretty_print_docs(docs):\n",
    "    # Print each document in the list with a separator between them\n",
    "    print(\n",
    "        f\"\\n{'-' * 100}\\n\".join(  # Separator line for better readability\n",
    "            [f\"Document {i+1}:\\n\\n\" + d.page_content for i, d in enumerate(docs)]  # Format: Document number + content\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "obj9TBnHjFBt",
    "outputId": "f3a0ecd4-e9ba-4bcf-a509-ef38631be18a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "Word2Vec\n",
      "Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n",
      "Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n",
      "Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 2:\n",
      "\n",
      "Token\n",
      "Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence.\n",
      "Example: The sentence \"I go to school\" can be tokenized into \"I,\" \"go,\" \"to,\" and \"school.\"\n",
      "Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 3:\n",
      "\n",
      "Example: A customer information table in a relational database is an example of structured data.\n",
      "Related Keywords: Database, Data Analysis, Data Modeling\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 4:\n",
      "\n",
      "Schema\n",
      "Definition: A schema defines the structure of a database or file, detailing how data is organized and stored.\n",
      "Example: A relational database schema specifies column names, data types, and key constraints.\n",
      "Related Keywords: Database, Data Modeling, Data Management\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 5:\n",
      "\n",
      "Keyword Search\n",
      "Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems.\n",
      "Example: Searching \n",
      "When a user searches for \"coffee shops in Seoul,\" the system returns a list of relevant coffee shops.\n",
      "Related Keywords: Search Engine, Data Search, Information Retrieval\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 6:\n",
      "\n",
      "TF-IDF (Term Frequency-Inverse Document Frequency)\n",
      "Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n",
      "Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n",
      "Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 7:\n",
      "\n",
      "SQL\n",
      "Definition: SQL (Structured Query Language) is a programming language for managing data in databases. \n",
      "It allows you to perform various operations such as querying, updating, inserting, and deleting data.\n",
      "Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18.\n",
      "Related Keywords: Database, Query, Data Management\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 8:\n",
      "\n",
      "Open Source\n",
      "Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n",
      "Example: The Linux operating system is a well-known open source project.\n",
      "Related Keywords: Software Development, Community, Technical Collaboration\n",
      "Structured Data\n",
      "Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 9:\n",
      "\n",
      "Semantic Search\n",
      "Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant.\n",
      "Example: If a user searches for \"planets in the solar system,\" the system provides information about planets like Jupiter and Mars.\n",
      "Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 10:\n",
      "\n",
      "GPT (Generative Pretrained Transformer)\n",
      "Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input.\n",
      "Example: A chatbot generating detailed answers to user queries is powered by GPT models.\n",
      "Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_huggingface import HuggingFaceEmbeddings\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "# Load documents\n",
    "documents = TextLoader(\"./data/appendix-keywords.txt\").load()\n",
    "\n",
    "# Configure text splitter\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\n",
    "\n",
    "# Split documents into chunks\n",
    "texts = text_splitter.split_documents(documents)\n",
    "\n",
    "# # Set up the embedding model\n",
    "embeddings_model = HuggingFaceEmbeddings(\n",
    "    model_name=\"sentence-transformers/msmarco-distilbert-dot-v5\",\n",
    "    model_kwargs = {\"tokenizer_kwargs\": {\"clean_up_tokenization_spaces\": False}}\n",
    ")\n",
    "\n",
    "# Create FAISS index from documents and set up retriever\n",
    "retriever = FAISS.from_documents(texts, embeddings_model).as_retriever(\n",
    "    search_kwargs={\"k\": 10}\n",
    ")\n",
    "\n",
    "# Define the query\n",
    "query = \"Can you tell me about Word2Vec?\"\n",
    "\n",
    "# Execute the query and retrieve results\n",
    "docs = retriever.invoke(query)\n",
    "\n",
    "# Display the retrieved documents\n",
    "pretty_print_docs(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "T_6qFjxQpUc3"
   },
   "source": [
    "Now, let's wrap the ```base_retriever``` with a ```ContextualCompressionRetriever``` . The ```CrossEncoderReranker``` leverages ```HuggingFaceCrossEncoder``` to re-rank the retrieved results.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_D3uj-DOpwUt"
   },
   "source": [
    "Multilingual Support BGE Reranker: [```bge-reranker-v2-m3```](https://huggingface.co/BAAI/bge-reranker-v2-m3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "f5yrJ-FTjWJB",
    "outputId": "ddfebce9-e45a-4057-cadc-2767d7c26152"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document 1:\n",
      "\n",
      "Word2Vec\n",
      "Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n",
      "Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n",
      "Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 2:\n",
      "\n",
      "Open Source\n",
      "Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n",
      "Example: The Linux operating system is a well-known open source project.\n",
      "Related Keywords: Software Development, Community, Technical Collaboration\n",
      "Structured Data\n",
      "Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n",
      "----------------------------------------------------------------------------------------------------\n",
      "Document 3:\n",
      "\n",
      "TF-IDF (Term Frequency-Inverse Document Frequency)\n",
      "Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n",
      "Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n",
      "Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n"
     ]
    }
   ],
   "source": [
    "from langchain.retrievers import ContextualCompressionRetriever\n",
    "from langchain.retrievers.document_compressors import CrossEncoderReranker\n",
    "from langchain_community.cross_encoders import HuggingFaceCrossEncoder\n",
    "\n",
    "# Initialize the model\n",
    "model = HuggingFaceCrossEncoder(model_name=\"BAAI/bge-reranker-v2-m3\")\n",
    "\n",
    "# Select the top 3 documents\n",
    "compressor = CrossEncoderReranker(model=model, top_n=3)\n",
    "\n",
    "# Initialize the contextual compression retriever\n",
    "compression_retriever = ContextualCompressionRetriever(\n",
    "    base_compressor=compressor, base_retriever=retriever\n",
    ")\n",
    "\n",
    "# Retrieve compressed documents\n",
    "compressed_docs = compression_retriever.invoke(\"Can you tell me about Word2Vec?\")\n",
    "\n",
    "# Display the documents\n",
    "pretty_print_docs(compressed_docs)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "langchain-opentutorial-W8hXPYms-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
